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Abstract: The entire publicly available set of 37 genome sequences from the bacterial 
order Chlamydiales has been subjected to comparative analysis in order to reveal the 
salient features of this pangenome and its evolutionary history. Over 2,000 protein families 
are detected across multiple species, with a distribution consistent to other studied 
pangenomes. Of these, there are 180 protein families with multiple members, 312 families 
with exactly 37 members corresponding to core genes, 428 families with peripheral genes 
with varying taxonomic distribution and finally 1,125 smaller families. The fact that, even 
for smaller genomes of Chlamydiales, core genes represent over a quarter of the average 
protein complement, signifies a certain degree of structural stability, given the wide range 
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of phylogenetic relationships within the group. In addition, the propagation of a corpus of 
manually curated annotations within the discovered core families reveals key functional 
properties, reflecting a coherent repertoire of cellular capabilities for Chlamydiales . We 
further investigate over 2,000 genes without homologs in the pangenome and discover two 
new protein sequence domains. Our results, supported by the genome-based phylogeny for 
this group, are fully consistent with previous analyses and current knowledge, and point to 
future research directions towards a better understanding of the structural and functional 
properties of Chlamydiales. 

Keywords: comparative genomics; pangenome analysis; Chlamydiales; protein family 
detection; genome annotation; genome trees 



1, Introduction 

Members of the order Chlamydiales are obligate intracellular bacteria, characterized by a unique 
developmental cycle and are important pathogens of humans and animals resulting in a wide range of 
diseases, including several zoonoses [1-3]. The order Chlamydiales, separated from other eubacteria 
by forming a deep branch in ribosomal RNA-based phylogenetic trees, has been enriched by new 
lineages. Beside the family Chlamydiaceae, in which important chlamydial pathogens are grouped, 
new families, such as Parachlamydiaceae, Simkaniaceae and Waddliaceae, have been recognized to 
accommodate newly discovered pathogenic and non-pathogenic chlamydial organisms [4-6]. 

Since the release of the first chlamydial genome sequence from Chlamydia trachomatis (serovar D) [7], 
new genomes are being sequenced, thus offering insights into the genome organization and functional 
capacity of the corresponding species [8]. Besides its crucial importance for applied research in 
medical and veterinary microbiology [9], this corpus of genomic information is also key to 
understanding the evolutionary position of various chlamydial species (or strains) and the inference of 
the internal phylogeny of this distinct taxon [8,10-12]. 

As the intracellular lifestyle imposes constraints on gene content and metabolic capabilities, the 
Chlamydiales might represent one of the best datasets for the development of pangenome analysis 
methods [13]. Additional challenges are the wide variety of chlamydial genome sizes with unequal 
rates of reduction, and a repertoire of less characterized proteins than other bacterial groups whose 
pangenomes have been analyzed, e.g.. Streptococcus or Salmonella [14,15]. 

Previously, we have used the genome of Chlamydia trachomatis [7] as a case study for annotation 
transfer quality [16]. Using a novel encoding scheme and a scoring function called TABS for transitive 
annotation-based scale [16], our main finding regarding annotation was that, despite a number of 
inconsistencies, automated annotation pipelines performed remarkably well when benchmarked 
against a manually curated annotation corpus [16]. These results are important for the quantification of 
reproducibility and consistency in genome- wide annotation [17]. 

In this work, we explore the entire set of the Chlamydiales pangenome with a broad collection of 
genome sequences publicly available to date (3 1 Chlamydiaceae and six other Chlamydiales genomes), 
twice as many as in a similar recent analysis [18]. Importantly, our pangenome analysis pipeline 
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incorporates recently sequenced genomes of key Chlamydiaceae species not previously reported, thus 
augmenting our understanding from previous findings [18,19]. 

We focus on key aspects of pangenome analysis and explore multiple facets of the Chlamydiales 
gene content in terms of protein-coding genes and families. We also provide certain key findings that 
might illuminate the evolutionary history of this group as well as interesting sequence motifs not 
widely shared within this order. Beyond the confirmation of the recent analysis of the Chlamydiales as 
mentioned above [18], we also use this group to expand on methods for pangenome analysis [13,20] 
by proposing a pangenome analysis pipeline. Our results are consistent with wider studies of 
pangenomes [21] and provide additional knowledge for Chlamydiales. In conclusion, pangenome 
analysis offers an opportunity for the study of bacterial genome evolution, the development of relevant 
methods and the understanding of genome structure and proteome flinction on a large scale. 

2, Experimental Section 

2.1. Data Collection 

All protein sequence data from 37 genomes were compiled into a single data collection (February-July 
2011), including the most recent published Chlamydiales genomes. In total, 43,736 protein-coding 
genes were extracted from public databases corresponding to the entire set of 37 genome sequences from 
the bacterial order Chlamydiales currently available (Table 1). Sequence data were codified following 
the style of the COGENT database [22], for easy identification both by programs and human users 
(Supplement SI). The above notation is followed throughout this work. The COGENT scheme 
encodes genus and species names into a four-character identifier prefix string, followed by a code for 
the strain name, its version (in this collection all versions are considered as version 1 and optionally 
hidden) and finally for proteins the relative order of the sequence within the genome [23] (Table 1). 
We have also recorded the date of publication for the corresponding genome (or the release date where 
no publication was available) (Supplement S2). 



Table 1. List of Chlamydiales genome sequences used in this study. 



## 


Species and Strain Name/Codes 


Internal Identifier 


Protein-Coding Genes 


01 


Candidatus Protochlamydia amoebophila UWE25 


CPRO-UWE-01 


2,031 


02 


Chlamydia muridarum Nigg 


CMUR-NIG-01 


911 


03 


Chlamydia trachomatis 434/Bu 


CTRA-434-01 


874 


04 


Chlamydia trachomatis A/HAR- 1 3 


CTRA-AHA-01 


919 


05 


Chlamydia trachomatis B/Jali20/OT 


CTRA-BJA-01 


875 


06 


Chlamydia trachomatis B/TZ1A828/OT 


CTRA-BTZ-01 


880 


07 


Chlamydia trachomatis D/UW-3/CX 


CTRA-DUW-01 


895 


08 


Chlamydia trachomatis L2b/UCH-1 /proctitis 


CTRA-L2B-01 


874 


09 


Chlamydophila abortus S26/3 


CABO-S26-01 


932 


10 


Chlamydophila caviae GPIC 


CCAV-GPI-01 


1,005 


11 


Chlamydophila felis Fe/C-56 


CFEL-FEC-01 


1,013 


12 


Chlamydophila pneumoniae AR39 


CPNE-AR3-01 


1,112 


13 


Chlamydophila pneumoniae CWL029 


CPNE-CWL-01 


1,052 


14 


Chlamydophila pneumoniae J138 


CPNE-J13-01 


1,069 
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Chlamydia trachomatis G/1 1074 


CTRA-G74-01 


919 


34 


Chlamydia trachomatis G/1 1222 


CTRA-G22-01 


927 


35 


Chlamydia trachomatis G/9301 


CTRA-G93-01 


921 


36 


Chlamydia trachomatis G/9768 


CTRA-G97-01 


920 


37 


Chlamydia trachomatis Sweden2 


CTRA-SWE-01 


875 






Total 


43,736 



The first column signifies the inclusion order into the genome collection and does not reflect any 
other relationship. The second column lists the species and strain name, the third column the 
COGENT-style identifier and the last column the number of protein-coding genes. 



2.2. Sequence Comparison 

All protein sequence data were masked using CAST with default parameters (threshold = 40), to 
exclude compositionally biased regions [24]. In total, 6,906 such regions were filtered out, provided 
for further study (Supplement S3). 

The masked sequences were used as queries against the genome corpus, in an all-against-all mode 
with BLAST (blastall, e-value threshold 10"^) [25,26]; in total, more than 40,000 BLAST searches 
were performed and 1,709,325 significant similarities below threshold were obtained (Supplement S4). 

2.3. Clustering and Annotation 

The similarity pairwise list (from Supplement S4) was submitted to MCL sequence clustering [27], 
with default parameters (e.g., inflation value 2.0); clusters were incrementally assigned to an integer 
identifier. Clusters are sorted by their size (number of members in a cluster. Supplement S5); thus, the 
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largest clusters have smallest-integer identifiers (see Results and Appendix-Table 1). This approach 
has also been used successfially elsewhere [28] as a method of choice. 

Annotation transfer based on the first chlamydial genome ever sequenced was implemented through 
the direct matching of the lead sequences to a previously highly curated dataset for Chlamydia 
trachomatis D/UW-3/CX [7]. 

The annotation qualifiers used in the manually curated corpus [16] are: ENZYME (for enzymes 
with EC number assignments), FUNCTION (for other protein functions), SIMILAR-TO (for those 
sequences with a similarity to a protein of known fiinction but no specific assignment) and DOMAIN 
(for the existence of a known, named protein sequence domain) [16] (Appendix-Table 1). 

Sequence matching of the original dataset to the data collection presented here was performed by 
MagicMatch [29], which was the first scheme to implement the MD5 checksum for protein sequence 
identification, an approach later propagated in all major database resources. 

2.4. Analysis of Unique Genes 

All unique genes, i.e., more than 2,000 genes with no similarity within the pangenome, were 
searched against the non-redundant protein sequence database (nrdb: 15,052,178 entries) [30]. Results 
from this search were evaluated manually and key similarities were extracted for further investigation 
(Supplement S6). 

2.5. Genome Trees 

Genome-based trees were calculated using phylogenetic profile distance [31,32]. Similarity values 
were measured by the shared number of genes represented by phylogenetic profiles, symmetrified by 
the minimum shared value, normalized by minimum self-similarity and turned into distance values as 
previously described [32,33] (Supplement S7). 

2.6. Sequence Alignments 

Multiple sequence alignments were performed and visualized by JalView [34]. Novel motifs 
reported in this work are provided below and in PASTA format (Supplements S8,9). 

2. 7. Data Availability 

Per genome contributions to the pangenome are also provided (Supplement 10). All sequence data 
and results (in 10 Supplements) have been made available at datadryad.org, under the identifier [35]. 

3. Results 

3.1. General Characteristics of the Chlamydiales Pangenome 

The Chlamydiales collection herein contains over 40,000 protein-coding genes in total, with -1,200 
genes/genome on average, with significant deviations (Table 1). We take the view to present the two 
extreme tails of this data collection in detail following the clustering step for the identification of 
protein families within the pangenome and comment on the intermediate cases. In other words, we 
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primarily focus on the two classes of the most interesting clusters, (i) those containing the core genes 
and (ii) those corresponding to "unique" genes, without significant similarities within the pangenome, 
thus singleton clusters. The functional characterization of the entire complement as well as fiirther 
issues listed in the discussion for fiiture research are clearly beyond the scope of this critical review. 

3.2. Protein Families 

In total, the clustering has yielded 5,554 clusters corresponding to protein families. For practical 
purposes, we define a protein family as one that contains at least three genes: in that sense, there are 
294 cases, which do not detect themselves in this comparison (typically because of either short length, 
abnormal composition, or both), 2,038 unique genes (singletons) and 1,177 doublets. The remaining 
2,045 clusters represent protein families with three or more members, distributed across 37 genomes 
(Figiire 1). 

Figure 1. Pangenome protein family size distribution. Cluster size is displayed on the 
X-axis (bins until 50 are all shown; above 50, bins are shown for each ten counts, labels for 
every five bin sizes); absolute frequency of clusters is shown on the left y-axis (bars, green 
curve); cumulative count of clusters is shown on the right y-axis (orange curve). Families 
are defined as those clusters with at least three members (see text); all cluster frequencies 
are shown here for completeness. The bimodal nature of the distribution can be seen 
between the peak at low cluster sizes and 37; above 37 there are multi-member and 
multi-species protein families (see text). 
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It is evident that the protein family size distribution follows, as expected, the shape of other 
pangenome analyses, with a clear bimodal distribution, with one peak at low-count families which has 
been called the "accessory pool" and another peak at the limit of the genomes under consideration, which 
has been called the "extended core" [21]. The so-called "character genes" (which we prefer to define 
as "peripheral", as opposed to "core" genes) exhibit, by definition, a heterogeneous distribution across 
genomes (and between peaks) and present an additional challenge for further interpretation (Figure 1). 
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The peak at exactly 37 with 312 counts, i.e., 312 families with exactly 37 members, corresponds to 
the number of 37 genomes analyzed across the pangenome. Beyond that peak, there are 180 protein 
families with more than 37 members (clusters 1-180) (Supplement S5), of which ten contain more 
than 100 members and are discussed below. 

3.3. Multi-Member Families 

The four largest families with more than 120 members are represented by the ABC transporter 
permeases (530 members), the polymorphic outer membrane proteins of Chlamydiaceae [36] (POMPs, 
435 members), the flagellum-specific ATP synthases/type III secretion system ATPases, e.g., CT669 [37] 
(152 members), and a family of unknown ftmction recently characterized as type HI secreted effectors [38] 
(DUF582, 140 members) (Figure 2). 

Figure 2. Top ten multi-member families within the pangenome. Genomes (with full 
COGENT-like codes) are shown on the x-axis, sorted by total protein-coding gene count 
(see also Table 1). Absolute cumulative counts of multi-member families are shown on the 
y-axis (displayed in the figure legend from left to right and then top to bottom, e.g., ABC 
transporter permeases, POMPs, type III secretion system ATPases, etc. according to size, 
see text), color coded according to figure legend. 

7t> 




Following those, there are another four families with more than 110 members each: the EF-Tu/EF- 
G/LepA family (119 members), the oligopeptide binding protein family OppA (114 members), the 
GroEL family (111 members) and finally the Ile-Leu-Val (ILV)-tRNA synthetases (111 members). 
These are followed by two families with more than 100 members, namely the Dihydrolipoamide 
acetyltransferase E2 component/Dthydrolipoamide succinyltransferase (110 members) and the 3-oxoacyl- 
[acyl-carrier protein] reductase families (109 members) (Figure 2). 

A significant number of multi-member families contain proteins of known flinction (Supplement S5). 
Interestingly, families containing only homologues from S. negevensis, W. chondrophila. 
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P. acanthamoebae and Protochlamydia amoebophila are 172 in total, remarkably close to the 
171 clusters of "orthologous" proteins in this group of species reported recently [18]. 

3.4. Core Genes 

At the other end of the bimodal distribution, there are 312 families with 37 genes each, reflecting 
the number of genomes analyzed. However, there are eight clusters here with duplicates per genome 
(clusters 224, 460: S. negevensis; 254, 276, 420: P. acanthamoebae; 255, 272: P. amoebophila; 
429: W. chondrophila 203) (two of which of unknown functional roles, Appendix-Table 1). Thus, 
there are exactly 304 protein families with 37 genes each represented once in each genome, which can 
be truly called "core" genes, most of which have some source of annotation (Appendix-Table 1). 
These represent just over a quarter of the average chlamydial genome (304/1 182 - 26%). 

Annotations transferred from the manually curated seed annotation corpus of C. trachomatis reveal 
a wide range of functional roles for this core set, as expected (Appendix-Table 1). Indeed, 227 families 
of the core set can be assigned to a flinctional role, according to the annotation qualifiers originally 
used (see Experimental Section). Only an additional 77 cases in this set do not contain any annotation 
(Appendix-Table 1). It can be argued, therefore, that this level of characterization of 75% (227/304) 
across 37 genomes signifies a functional coherence that is consistent with our current knowledge of 
this taxonomic order. This list is provided for further investigation by the community; it is worth 
pointing out that it encompasses basic cellular roles in genetic information processing (e.g., cluster 184), 
including transcription (e.g., cluster 187) and translation (e.g., clusters 242-243), metabolic 
transformations (e.g., cluster 182 or 196), transport systems (e.g., clusters 193-195) and other key 
processes (e.g., cluster 192). It is interesting to note that apart from complements represented by 
ribosomal proteins or aminoacyl-tRNA synthetases, other systems are also coherently detected, for 
example the NifU [39]/NifS [40] genes (clusters 221-222). 

3.5. Peripheral Genes 

In the midst of the two extremes (viz. peaks) of the bimodal family size distribution, there exists a 
wide variety of cases with an anomalous and clearly heterogeneous pattern. There are 428 families 
with more than ten and less than 37 members (not shown, available in Supplement S5). Their 
hererogeneous composition is reflected by the fact that 217 of the 428 families (just over 50%) do not 
contain a homolog outside the Chlamydiaceae, i.e., across the larger genomes mentioned above. 
Within this group, however, there is a significant variation of family phylogenetic distribution (not 
shown) that needs to be explored in fliture research. 

3.6. Unique Genes 

In total, there are 2,038 unique genes represented by singleton clusters, thus not falling into families 
within the pangenome. The content of genomes with unique genes varies significantly, from 0 to 796 
{S. negevensis), with 55 unique genes on average. In percentage points, this varies from obviously 0 to 
32%) of the genome {S. negevensis), with an average of just over 3%) per genome (Figure 3). 
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Figure 3. Correlation between genome size and unique genes. Genome size is given as the 
number of protein-coding genes (shown on the x-axis) against the count of unique genes 
(number of unique genes without homologs within the pangenome, shown on the y-axis; 
y-axis is displayed on logarithmic scale). The six points on the upper right part of the graph 
are evidently those genomes with largest gene counts, all outside the Chlamydiaceae 
family (see Table 1 and text). The pattern observed is primarily due to the sampling of 
taxonomic space of the Chlamydiales and will vary as more genomes from this group 
become available. 
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The densest part of the phylogeny exhibits no unique genes — 17 genomes, including most of the 
C. trachomatis and C. psittaci strains, C. pneumoniae CWL029 and C. abortus LLG (Figure 3, 
missing points corresponding to 17 genomes with zero value on the y-coordinate, available in 
Supplement S5). Twenty genomes have unique genes, of which six genomes have less than 10 such 
genes and one with 15 unique genes (Figure 3), all from the above group, or less than 2% of their 
genome entries. Another five genomes with a handful of unique genes are C. pneumoniae AR39 
(33/3%), TW-183 (43/4%) and LPCoLN (60/5%) as well as C. felis (27/3%) and C. caviae (29/3%). 
The remaining eight genomes contain the majority of unique genes, 1818 in number or 89% of total, 
ranging from 66 (W. chondrophila WSU, 3%) of genome) to 796 genes {S. negevensis, 32%) of 
genome). This is not entirely a biological effect, rather a sampling artifact arising from the deeper 
sequencing of the C. trachomatis /C. pneumoniae group (see below). 

The six outliers which form a different group above (upper right. Figure 3) are all species with large 
genomes {ca. 2,000 protein-coding genes or more): the two W. chondrophila strains (3^%), the two P. 
acanthamoebae strains (4-8%), P. amoebophila (20%)), and S. negevensis (32%)), listed here according 
to the absolute number of their unique genes per genome. In relative terms, however, two species 
namely C. muridarum (36/4%), and C. pecorum (76/8%) contain a significant number of unique genes 
given their relatively small genome size (both less than 1,000 protein-coding genes). 
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3.7. Properties of Unique Genes 

The genes considered as singletons in this analysis are 2,038 as mentioned above. Of those, a 
number of short genes might fall into pangenome families (not shown) but do not seriously affect the 
overall assessment (e.g., case CCAV-GPI-0 1-000824 in Supplement S6). This is an artifact of 
sensitivity for the two different searches, first against the 40,000 or so genes of the pangenome and 
second against the entire nrdb database of more than 15 million sequences. While a full analysis of the 
unique gene complement of the Chlamydiales is under progress, it is interesting to report on a number 
of findings pertinent to this work. 

A number of genes from the pangenome have identified homologs such as cell-wall associated 
hydrolases (TC0114 from C. muridarum Nigg), proteins of unknown function (e.g., pc0061, pc0549, 
pc0850, pc0855), endonucleases (e.g., pc0252), exonucleases (pc0951), transposases (e.g., pc0068), 
DNA repair proteins (e.g., pc0286), acyltransferases (e.g., pc0180), Mg chelatases (pc0480), 
oxidoreductases (pc0504), streptomycin 6-kinases (e.g., pc0510), metallophosphoesterases (pc0948) 
from P. amoebophila and LmbE/ypjG family proteins (e.g., wcw_0275) or transposases (e.g., wcw_0482) 
from W. chondrophila WSU. Similarly, multiple cases of similarity to families of known or unknown 
function are discovered for unique genes from the larger genomes (not shown). 

One such domain is an enigmatic, short and highly conserved motif containing the triplet Pro-Cys-Tyr 
(PCY), present in the C. pneumoniae AR39 CP0988 protein. This protein is 52 residues long and does 
not exhibit significant similarities to any other protein in the Chlamydiales pangenome. However, it 
does show similarity to a set of short proteins (<100 residues long) from various species, including 
Acinetobacter, Brucella, Clostridium, Coxiella, Curvibacter, Eubacterium, Parvimonas, Rhizobium, 
Ruminococcus , Selenomonas, Streptomyces, other longer proteins from Chloroflexi, Heliobacterium, 
Lactobacillus, the C-terminus of a Propionibacterium protein (HL046PA2) and an uncultured 
Acidobacteria bacterium HF4000_26D02, and importantly, to a number of longer plant proteins from 
Nicotiana tabacum, Pinus koraiensis, Solanum demissum (middle of protein) and Vitis vinifera 
(A^-terminus, total length 1,193 residues) (Supplement S8). This conserved region with this peculiar 
phylogenetic distribution has not been characterized previously to our knowledge, and can be 
considered a genuine novel domain of unknown flinction (Figure 4). It remains unclear whether the 
domain has been universally lost from the Chlamydiales pangenome or acquired from C pneumoniae 
through horizontal transfer. 

Another interesting example of a unique protein is the P. amoebophila pc0506. This 82-residue-long 
uncharacterized protein is evidently absent from the core pangenome and yet it exhibits significant 
similarity to four Verrucomicrobia proteins from Verrucomicrobium spinosum, Chthoniobacter flavus, 
Pedosphaera parvula and Coraliomargarita akajimensis, in this order of similarity, ranging from 53% 
down to 44% sequence identity (Figure 5). The above mentioned proteins reportedly belong the leucyl 
aminopeptidase superfamily (Supplement S9). The fiinctional significance of this biochemical role for 
P. amoebophila is not yet understood. Yet, the strong mutual similarity of this protein family with 
Verrucomicrobial and P. amoebophila members (no other member in the entire pangenome) can be 
placed within the general controversy of the connection of Chlamydiales with the so-called PVC 
group [41,42] (see below). 
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Figure 4. Alignment of the PCY domain. The PCY motif is centered around position 15 of 
the multiple alignment. The domain was discovered following five iterations with PSI-BLAST 
with CP0988 as query sequence (01:16752158), until convergence and an e-value 
threshold 0.005. In total 70 sequences were recovered; redundancy was removed at 95% 
with Jalview [34], resulting in 32 sequences shovm here. The length of the domain is just 
30 residues; boxes signify sequence identity at 50% or above (darker color: more 
conserved). GI labels are provided, along with sequence coordinates on the left of the 
alignment (see text for more details and discussion). 
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Figure 5. Alignment of a unique leucyl aminopeptidase family. The domain was 
discovered following five iterations with PSI-BLAST with pc0506 as query sequence 
(YP_007505.1). Display conventions as in Figure 4. 
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In all, it appears that properties encoded from most unique genes, apart from their unusual 
phylogenetic distribution, represent accessory fiinctional roles that provide additional versatility to the 
largest genomes in the group, possibly related to their extra functional capabilities. Two exceptions 
with seemingly central functions are wcw_0805, with similarity to the 50S L34 ribosomal protein 
family and wcw_861, with similarity to 6-pyruvoyl tetrahydrobiopterin synthases, both from 
W. chondrophila WSU (not shown). 
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3.8. Protein Family Contributions from Genome Projects 

As mentioned above, we have tracked the original publication (and/or release) data for the genomes 
under consideration, in terms of novel families detected per genome sequence (Supplement SIO). By 
mapping the protein families which appear first in this ranking order, we can thus estimate the relative 
"novelty" or contribution of previously unseen protein families within the chlamydial pangenome and 
the typical "pangenome saturation curve" (Figure 6). 

Figure 6. Protein family contributions from genome projects. Genome codes are sorted 
according to their original publication date (and/or release date, x-axis); absolute number 
of "novel" protein families within the pangenome are given (left y-axis, blue curve and 
square symbols); cumulative sum of protein families (up to 5,260, excluding those without 
self-hits, see text) is also shown, defined as a "pangenome saturation curve" (right y-axis, 
green curve and square symbols). 
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As expected, and discussed above (Figure 3), for the densest part of the group, little or no 
contributions have been provided. Apart from the larger genomes, which have added hundreds of new 
gene types [19], the more distant members of the group with small genomes, for instance C. caviae or 
C. pecorum, have also contributed a significant number (80 and 76, respectively — Supplement SIO). 

3.9. Genome Phylogeny 

Finally, we have reconstructed the genome phylogeny of the pangenome based on the sharing of 
phylogenetic profile patterns based on the above analysis (see Experimental Section). Evidently, the 
pangenome is stratified according to the known, established phylogeny patterns [10] (Figure 7). The 
genome tree is another concise way to visualize the "novelty" components of the various species and 
strains that have been sequenced, exemplified above in various contexts, e.g., number of unique genes 
(Figure 3) or the tracking of the relative contributions of novel protein families (Figure 6). A future 
aspect of this work will be to infer the history of the pangenome using methods of ancestral state 
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reconstruction [43]. The evolutionary history of the Chlamydiales as reflected by the genome tree 
might also shed light on the ongoing controversy about their status within the tree of life [41]. 

Figure 7. Genome tree of the Chlamydiales. Dendrogram representing phylogenetic 
relationships of the 37 Chlamydiales genomes analyzed, based on sharing of phylogenetic 
profiles (see Experimental Section for details). Genome codes are given as labels. 
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The genome tree accurately reflects the current taxonomy of Chlamydiales [4,44], with a couple of 
notable exceptions namely the clustering of C. abortus with C. psittaci, the closer relationship of 
C. felis with the former two species against C. caviae — in agreement with previous findings [6,44] but 
not with other proposals [8] — as well as the distinct relationship of C. pecorum at the root of 
Chlamydiacae and not as a sister group of C. pneumoniae [4,44]. The resulting phylogenetic tree using 
genome-wide phylogenetic profile sharing patterns can also act as an internal control of the 
pangenome analysis, since all the closely related strains sequenced are grouped together with very high 
accuracy (Figure 7). 

4. Discussion 

Our results suggest that the Chlamydiales pangenome refiects a certain degree of structural stability, 
as core genes represent over a quarter of an average genome, as well as functional coherence, in the 
sense that most functional properties of these genes are consistent with current knowledge. Unlike 
various claims in the recent literature, it turns out that, at least in the case of a highly constrained 
pangenome of intracellular pathogens, there is an unexpected degree of stability, given the wide range 
of phylogenetic relationships within this particular taxon. 

It is thus shown that for the smallest of genomes (<900 protein-coding genes), over a third of their 
gene content is shared with larger genomes (>2,000 genes), decorated by a broader element of so-called 
"character", or peripheral, genes. This distribution, which in turn is influenced by the sampling of 
phylogeny and other factors, requires further investigation, being beyond the scope of this work. 

It should also be pointed out that the Chlamydiales pangenome exhibits general characteristics of 
distribution not dissimilar to other recent pangenome analyses, including those of the Salmonella 
pangenome with 45 strains [15], the Streptococcus pneumoniae pangenome with 44 strains [14] and 
the Campylobacter pangenome with 96 strains [28], suggesting the conservation of a core pangenome 
within and across bacterial taxa that have been sampled adequately. In the case of Salmonella, tracking 
the contributions of new strains to the entire core set and the pangenome suggests a slight expansion 
with more sampling and a stable core, reminiscent of the Chlamydiales, with one third of the 
pangenome represented in the core set [15]. A slightly less stable pattern is detected in the 
Streptococcus pneumoniae group [14], possibly due to a wider diversity in that sample, yet with a 
similar pattern of core set saturation. Interestingly, an attempt for ancestral reconstruction in the 

5. pneumoniaelS. mitis complex suggests that there is a dual process of genome expansion and 
reduction in the different paths leading to the genomes of contemporary strains [14]. A more 
comprehensive analysis of the Campylobacter pangenome with 96 strains [28], using a combination of 
experimental and theoretical work, also points to the same direction: Within the two species groups 
examined, the core gene set overlap reaches 80%, supporting earlier findings for the related 
Helicobacter pylori strains [45]. 

5. Conclusions 

We have thus examined the salient features of the Chlamydiales pangenome, introducing a 
pangenome analysis pipeline and certain definitions that facilitate the discovery of core and peripheral 
genes, the identification of unique genes with various origins as well as the detection of novel protein 
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sequence domains. We expect that analogous efforts will lead to rigorous standards for pangenome 
analysis in the future. Future research opportunities abound, for example: ancestral reconstruction [43], 
syntenic patterns of genome structure (e.g., [28,45]), the (presently limited) enrichment with 
expression data, the evolutionary histories of 'peripheral' genes (as discussed above), the connection 
of Chlamydiales with plants [46-50], the position of the Chlamydiales in the tree of life, and the 
connection with the PVC superphylum [41,42,50]. Wider challenges that go beyond the above 
pangenome-specific issues might include a more detailed annotation of the entire dynamic range of 
family distribution [21], the characterization of protein fiinction in a wider context including 
comparative metabolic reconstructions [19], the evolution of mobile elements [51], the deeper 
understanding of the physiological and pathological properties [52,53] of the strains that have been 
sequenced and the connection with other pangenomes [28]. 
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Appendix-Table 1. Core gene and protein families in the Chlamydiales . 



Cluster ID 


Lead Sequence 


Master Sequence 


Function Annotation from Master 
Sequence 


181 


CABO-LLG-0 1-000000 


CTRA-DUW-0 1 -000647 


NA 


182 


CABO-LLG-0 1-000003 


CTRA-DUW-0 1 -000644 


ENZYME UDP-A^-acetylglucosamine 
pyrophosphorylase [EC] 2.7.7.23 


183 


CABO-LLG-0 1-000004 


CTRA-DLC-0 1-000248 


FUNCTION PhoB-like protein 


184 


CABO-LLG-0 1-0000 15 


CTRA-DLC-0 1-000228 


FUNCTION RecA protein 


185 


PACA-UV7-01-001616 


CTRA-DUW-0 1 -000487 


NA 


186 


CABO-LLG-0 1-000023 


CTRA-DUW-0 1-00065 8 


NA 


187 


CABO-LLG-0 1-000024 


CTRA-DUW-0 1 -000624 


FUNCTION RNA Polymerase Sigma-54 
factor RpoN 


188 


CABO-LLG-0 1-000026 


CTRA-DUW-0 1-000622 


ENZYME Uracil DNA glycosylase [EC] 
3.2.2.- 
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Cluster ID 


Lead Sequence 


Master Sequence 


Function Annotation from Master 
Sequence 


189 


CABO-LLG-01-000028 


CTRA-DUW-0 1 -000620 


SIMILAR-TO NTPase HAMl homolog 
[LCJ J.o.i.ij 


190 


CABO-LLG-01-000029 


CTRA-DLC-0 1-000273 


NA 


1 m 

191 


A I -> /-"A T T A1 AAAAO^ 

CABL)-LL(j-0 1-000032 


~P n A T~\T T"\T 7 A1 AAAZT 1 ZT 

C lKA-DUW-01-000616 


NT A 

NA 


192 


CABO-LLG-0 1-000034 


CTRA-DLC-0 1-000278 


FUNCTION Peptidoglycan-associated 
lipoprotein 


193 


CABO-LLG-01-000035 


CTRA-DLC-0 1-000279 


r U N C 1 lOJN i olB macromolecule uptake 
homolog 


194 


CABO-LLG-01-000037 


CTRA-DUW-0 1-0006 11 


FUNCTION TolR/ExbD macromolecule 
uptake homolog 


195 


CABO-LLG-0 1-000040 


CTRA-DLC-0 1-000284 


FUNCTION protein translocase TatD/MttC 
homolog 


196 


CABO-LLG-0 1-000047 


CTRA-DUW-0 1 -000600 


ENZYME enolase [EC] 4.2.1.1 1 


197 


\ Ti /~\ TT/"* A1 AAAAylO 

CABO-LLG-0 1 -000048 


/"'T^n A T^T T^T 7 A1 AAACAA 

ClRA-DUW-0 1-000599 


F UNCI ION Excmuclease ABC subunit B 


198 


CABO-LLG-0 1-000049 


CTRA-DUW-0 1-000598 


ENZYME Iryptophanyl-tRNA Synthetase 
[EC] 6.1.1.2 


199 


CTRA-G22-01-000161 


CTRA-DUW-0 1-000746 


ENZYME beryl-tRNA byntnetase [ECJ 
6.1.1.11 


200 


CABO-LLG-0 1-000054 


CTRA-DUW-0 1-000593 


FUNCIION Nickel transporter Cnrl 
homolog 


O A 1 

201 


A T~)/~\ T T A1 AAAA/' 1 

CABO-LLG-0 1 -00006 1 


/-"■" -p T~i A T^T T^T 7 A1 AAACO/' 

C 1 RA-DU W-0 1 -0005 86 


NT A 

NA 


202 


CABO-LLG-0 1-000062 


CTRA-DUW-0 1-0005 85 


FUNC HON type 11 secretion system 
protein D homolog 


203 


CABO-LLG-0 1-000063 


CTRA-DUW-0 1-0005 84 


FUNC HON type II secretion system 
protein E homolog 


204 


CABO-LLG-0 1-000064 


CTRA-DLC-0 1-000308 


FUNC HON type II secretion system 
protein F homolog 


O AC 

205 


A T-)/-\ T T A1 AAAA/'C 

CABO-LLG-0 1-000065 


/^T^n A T~\T A1 AAAOAA 

C lRA-DLC-0 1-000309 


KT A 

NA 


206 


CABO-LLG-0 1-000070 


CTRA-DUW-0 1-000577 


FUNCTION protein secretion system YscT 
homolog 


207 


CABO-LLG-0 1-000072 


CTRA-DLC-0 1-0003 16 


FUNCTION protein secretion system YscR 
homolog 


208 


CABO-LLG-0 1-000073 


CTRA-DEC-0 1-0003 17 


FUNCTION protein secretion system YscL 
homolog 




CABU-LLLi-Ui-UUUU /4 


L, IKA-UU VV-Ui-UUUj /j 


\T A 

NA 


210 


CABO-LLG-0 1-000076 


CTRA-DUW-01-000571 


ENZYME lipoate synthase [EC] 2.8.1.- 




L-ArJU-ijijO-U i -UUUUo 1 


1 ]x/\-iJU W -U i -UUU / I't 


ciN Zj 1 ivi jj, jinaonuciease iii [iiL^j 't.z.yy.io 


212 


CABO-LLG-0 1-000083 


CTRA-DUW-0 1-0007 16 


ENZYME Phosphatidylserine 
decarboxylase [EC] 4.1.1.65 


213 


CABO-LLG-0 1-000085 


CTRA-DUW-0 1-0007 18 


FUNCTION preprotein translocase subunit 
SecA 


214 


CABO-LLG-01-000089 


CTRA-DUW-0 1 -000722 


ENZYME ATP-dependent Clp protease 
ATP-binding subunit ClpX [EC] 
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Cluster ID 


Lead Sequence 


Master Sequence 


Function Annotation from Master 
Sequence 


O 1 C 

215 


CABL)-LL(j-0 1 -00009 1 


C IKA-DUW -01 -000724 


b UNCI ION 1 rigger lactor 


216 


CABO-LLG-0 1-000093 


CTRA-DUW-0 1 -000726 


FUNCTION Rod shape-determining protein 
MreB 


217 


CABO-LLG-0 1-000094 


CTRA-DUW-0 1-000727 


ENZYME Phosphoenolpyruvate 
carboxykmase (GIF) [ECJ 4.1.1.32 


218 


CABO-LLG-0 1-000098 


CTRA-DUW-0 1-000731 


ENZYME Glycerol-3 -phosphate 
aenydrogenase [JNAJJ+j [tcj i.i.i.o 


219 


CABO-LLG-0 1-000099 


CTRA-DUW-0 1 -000732 


ENZYME UDP-N-acetylhexosamine 
pyrophosphorylase [EC] 2.7.7.- 


220 


CCAV-GPLOl-000128 


CTRA-DUW-0 1-000503 


FUNCTION Transcription termination 
factor Rho 


zzl 


CABU-LLLi-U 1 -UUU 1 U4 


LlKA-UUVV-Ul-UUU/j / 


UUMAIN NllU 


222 


PACA-HAL-0 1-0025 18 


CTRA-DUW-0 1-000261 


ENZYME NifS aminotransferase [EC] 


223 


CABO-LLG-0 1-000 109 


CTRA-DUW-0 1 -000742 


ENZYME Biotm-[acetyi-CoA-carboxylaseJ 
syntnetase [ECJ o.j.4.l5 


224 * 


CABO-LLG-0 1-000 121 


CTRA-DUW-0 1 -000754 


DOMAIN SET 


225 


CABO-LLG-0 1-000 122 


CTRA-DUW-0 1-000755 


SIMILAR- i O metallo-beta-lactamase [ECJ 
3.5.-.- 


226 


CABO-LLG-0 1-000 123 


CTRA-DUW-0 1-000756 


F UN C 1 ION Cell division protein F tsK C- 
terminus 


227 


A Ti /~\ T T A 1 AAA 1 C 

CABO-LLG-0 1-000 125 


/"'T^n A T^T T^T 7 A1 AAATC? 

ClRA-DUW-0 1-000757 


"NT A 

NA 


228 


CABO-LLG-0 1-000 126 


CTRA-DUW-0 1-00075 8 


FUNCTION preprotein translocase complex 
subunit YajC 


229 


CABO-LLG-01-000130 


CTRA-DUW-0 1-000762 


ENZYME Protoporphyrinogen oxidase 
HemY [EC] 1.3.3.4 


230 


CABO-LLG-0 1-000 132 


CTRA-DUW-0 1-000764 


ENZYME Uroporphyrinogen 
decarboxylase HemE [EC] 4.1.1.37 


231 


CABO-LLG-0 1-000 134 


CTRA-DLC-0 1-000 129 


ENZYME Alanyl-tRNA Synthetase [ECJ 
6.1.1.7 


232 


A T) /~\ TT/"* A1 AAAIOC 

CABO-LLG-0 1 -000 135 


/^T^n A T^T T^T 7 A1 A A A /I ^7 

ClRA-DUW-0 1-000767 


ENZYME 1 ransketolase [ECJ 2.2.1.1 


233 


CABO-LLG-01-000136 


CTRA-DUW-0 1-000768 


SIMILAR-TO AMP nucleosidase [EC] 

i.L.LA 


234 


CABO-LLG-0 1-000 142 


CTRA-DUW-0 1 -000774 


ENZYME Phospho-A^-acetylmuramoyl- 
pentapeptide-transferase [EC] 


235 


CABO-LLG-0 1-000 143 


CTRA-DUW-0 1-000775 


ENZYME UDP-jV-acetylmuramoylalanine- 
D-glutamate ligase [EC] 


236 


CABO-LLG-0 1-000 144 


CTRA-DUW-0 1-000776 


SIMILAR-TO jV-acetylmuramoyl-L-alanine 
amidase C-terminus [EC] 


237 


CABO-LLG-01-000146 


CTRA-DLC-01-000117 


ENZYME UDP-iV-acetylglucosamine-iV- 
acetylmuramyl-(pentapeptide) 
pyrophosphoryl-undecaprenol 
A'^-acetylglucosamine transferase 
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Cluster ID 


Lead Sequence 


Master Sequence 


Function Annotation from Master 
Sequence 


T2 O 


CAoU-oZO-Ui-UUUji / 


L, IKA-JJU W-Ui-UUUlzj 


bJNZYMr, rJiotm carboxylase [tCJ 0.3.4.14 




CArJU-LLLi-U i -UUU i jU 


f^'~V\i A T~\T TAA7 Ml MMMTO 1 

CI KA-JJ U W -U i -UUU / 0 i 


^T A 
NA 


24U 


CArJU-LLLi-U i -UUU i j j 


f^'~VQ A T~\T TAA7 Ml MMMTO/; 

C IKA-JJU W-Ui -UUU /oo 


^T A 
NA 


241 


CABO-LLG-01-000157 


CTRA-DUW-0 1-000788 


ENZYME bis(5'-nucleosyl)- 
tetraphosphatase [EC] 3.6.1.17 


242 


CABO-LLG-0 1-000 168 


CTRA-DLC-0 1-000098 


ENZYME Cysteinyl-tRNA Synthetase [EC] 
o. 1 . 1 . lo 


243 


CABO-LLG-0 1-000 173 


CTRA-DUW-0 1-000804 


FUNCTION Ribosomal protein S14 


244 


A n r\ T T A 1 AAA 1 T /I 

CABU-LLu-0 1 -000 1 74 


/-" ' -p r") A T^T T^T 7 A1 AAAOAC 

C IRA-DUW-0 1-000805 


XT A 

NA 


245 


CABO-LLG-0 1-000 176 


CTRA-DUW-0 1-000808 


ENZYME Excinuclease ABC subunit C 

[EC] 


246 


CABO-LLG-0 1-000 177 


CTRA-DUW-0 1-000809 


FUNCTION DNA mismatch repair protein 
MutS 


247 


CABO-LLG-0 1-000 184 


CTRA-DUW-0 1-0008 15 


bN Z Y ME CDF-ciiacylglycerol-glycerol-3 - 
phosphate 


248 


CABO-LLG-0 1-000 185 


CTRA-DUW-0 1-0008 16 


bJNZiiVir, Ulycogen syntnase [tcj z.4.i.zi 

0 

z 


Z'Vy 


L-/\JjU-ijijLr-U 1 -UUU 1 oO 


1 KA-UU W -Ui -UUUo i / 


r uiNC iiuiN KiDOSomai proiein i^zj 


250 


CABO-LLG-0 1-000 187 


CTRA-DUW-0 1-0008 18 


ENZYME Peptidyl-tRNA hydrolase [EC] 

'X 1 1 IQ 

J. 1 . 1 .zy 


251 


CABO-LLG-01-000188 


CTRA-DUW-0 1-0008 19 


FUNCTION Ribosomal protein S6 




CArJU-LLLi-U i -UUU V&y 


f^'~VQ A T~\T TAA7 Ml MMMOOM 

C IKA-UU W-Ui-UUUozU 


r U JN C 1 lUJN Ribosomal protein a 1 8 




A DO T T M 1 r\MM1 CiC\ 

CArJU-LLLi-Ui-UUUiyU 


f^'~VQ A T~\T TAA7 Ml MMMOO 1 

C IKA-UU W-Ui-UUUoZi 


r U JN C 1 lUJN Ribosomal protein 




CArSU-LLLr-U i -UUU i y j 


/"'T^D A T~\TTAA7 Ml MMMOO'2 

CI KA-U U W -U i -UUUsZ J 


^T A 
NA 


255 * 


CABO-LLG-0 1-000 194 


CTRA-DUW-0 1-000824 


SIMILAR-TO Small-peptide endopeptidase 

rt?/^! '2 A O/l 

[ECJ J.4.Z4.JJ 


256 


CABO-LLG-0 1-000 195 


CTRA-DLC-0 1-000073 


ENZYME Glycerol-3 -phosphate 
acyltransferase [EC] 2.3.1.15 


257 


CABO-LLG-0 1-000 196 


CTRA-DLC-0 1-000072 


ENZYME Ribonuclease E [EC] 3.1.4.- 


258 


A l~>/~\ T T A1 AAA 1 AT 

CABO-LLG-0 1 -000 1 97 


/^T^n A T~\T A1 AAAAT 1 

C 1 RA-DLC-0 1 -00007 1 


XT A 

NA 


259 


CABO-LLG-0 1-0002 14 


CTRA-DLC-0 1-000063 


bNZYME Glucosamine-lTuctose-6- 
phosphate aminotransferase [EC] 


260 


CABO-LLG-0 1-0002 18 


CTRA-DUW-0 1-000840 


ENZYME Succinyl-CoA synthetase beta 
chain [EC] 6.2.1.5 


261 


CABO-LLG-0 1-000222 


CTRA-DUW-0 1 -000843 


SIMILAR-TO Small-peptide endopeptidase 
[EC] 3.4.24.55 


262 


CABO-LLG-0 1-000224 


CTRA-DUW-0 1-000845 


bNZYME cur-diacylglycerol-senne 
O-phosphatidyltransferase [EC] 


263 


CABO-LLG-0 1-000229 


CTRA-DUW-0 1-000850 


ENZYME UDP-iV- 

acetylenolpyruvoylglucosamine reductase 
[EC] 


264 


CABO-LLG-01-000230 


CTRA-DLC-0 1-000047 


FUNCTION Transcription termination 
protein NusB 
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Cluster ID 


Lead Sequence 


Master Sequence 


Function Annotation from Master 
Sequence 


265 


CABO-LLG-0 1-000231 


CTRA-DLC-0 1-000046 


NA 


zoo 


CArJU-LLLi-U 1 -UUUZj J 


C IKA-JJU VV-U1-UUUoj4 


r UJNC IIUJN KiDOSomal protein LZU 


267 


CABO-LLG-0 1-000234 


CTRA-DUW-0 1-000855 


JiJNZYMii rnenylalanyl-tKJNA Jjyntnetase 
alpha chain [EC] 6.1.1.20 


zoo 


CAoU-LLLi-UI-UUUZjo 


CIKA-UU VV-Ul-UUUoj / 


M A 

JNA 


269 


CABO-LLG-01-000237 


CTRA-DUW-0 1-00085 8 


NA 


270 


CABO-LLG-0 1-000240 


CTRA-DUW-0 1-000861 


ENZYME Polynucleotide phosphorylase 
[EC] 2.7.7.8 


O T 1 

271 


A Ti /~\ T T A 1 A A AO A 1 

CABO-LLG-0 1 -00024 1 


/^T^n A T^T T^T 7 A1 AAAO/^O 

C 1 RA-DU W-0 1 -000862 


"NT A 

NA 


272 * 


CABO-LLG-0 1-000254 


CTRA-DUW-0 1 -000874 


FUNCTION ABC transporter, ATP- 
binding protein jV-terminus 


273 


CABO-LLG-0 1-000267 


CTRA-DUW-0 1-0003 85 


ENZYME Glucose-6-phosphate isomerase 
[ECJ 5.3.1.9 


274 


CABO-LLG-0 1-000269 


CTRA-DLC-0 1-000502 


ENZYME Malate dehydrogenase [EC] 

111 

1.1.1.82 


275 


CABO-LLG-0 1-000271 


CTRA-DUW-0 1-0003 82 


SIMILAR-TO D-Amino Acid 
Dehydrogenase [EC] 1.-.-.- 


276 * 


CABO-LLG-0 1-000276 


CTRA-DLC-0 1-000508 


ENZYME 3-dehydroquinate dehydratase 

rT7/^n /I o 1 1 A 

[ECJ 4.2.1.10 


277 


CPRO-UWE-0 1-000881 


CTRA-DUW-0 1 -000373 


ENZYME 3 -phosphoshikimate 1 - 
carboxyvinyltransferase [EC] 


278 


CABO-LLG-0 1-000277 


CTRA-DUW-0 1 -000376 


ENZYME 3-aehyaroqumate synthase [ECJ 

4. 0.1. J 


279 


CABO-LLG-0 1-000278 


CTRA-DUW-0 1-000375 


ENZYME Chorismate synthase [EC] 
4.0.1.4 


280 


CABO-LLG-01-000288 


CTRA-DUW-0 1-000371 


ENZYME Dihydrodipicolinate reductase 

rpr^i 111 T/; 
[JiUJ l.J.l.ZO 


Zoi 


/-"'ADO T T n m nnmon 
UArJU-ijijLr-U i -UUUZyU 


1 KA-JJU W-UI-UUUjOV 


JirMZ/iiViji AspartoKinase [JiL^J z. / .z.4 


282 


CABO-LLG-0 1-000298 


CTRA-DUW-0 1 -000328 


NA 


283 


SNEG-ZXX-0 1-000625 


CTRA-DLC-0 1-000783 


FUNCTION Translation initiation factor IF- 
2 


Zo4 


CArSU-LLU-U i -UUU3U4 


C 1 KA-UU W -U1-UUUjZZ 


rUJNCllUN KiDOSomal protein LI 1 


Zoj 


r^AT30 T T m nnn'in^ 
L.AoU-LLU-Ui-UUUJUj 


C IKA-JJU W-Ui-UUUjZi 


rUJNUllUJN KiDosomal protein LI 


286 


CABO-LLG-0 1-000306 


CTRA-DUW-01-000320 


FUNCTION Ribosomal protein LIO 


287 


CABO-LLG-01-000308 


CTRA-DUW-0 1-0003 18 


ENZYME DNA-directed RNA polymerase 
oeid suDunii [in—j z. / . / .o 


288 


CABO-LLG-01-000309 


CTRA-DUW-0 1-0003 17 


ENZYME DNA-directed RNA polymerase 
beta prime subunit [EC] 


289 


CABO-LLG-0 1-0003 12 


CTRA-DUW-0 1-0003 14 


NA 


290 


CABO-LLG-0 1-0003 13 


CTRA-DLC-0 1-000569 


ENZYME vacuolar ATPase proteolipid 
subunit E [EC] 3.6.1.34 


291 


CABO-LLG-0 1-0003 14 


CTRA-DUW-0 1-0003 12 


NA 
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Function Annotation from Master 
Sequence 


292 


CABO-LLG-01-000317 


CTRA-DUW-0 1-000309 


ENZYME vacuolar ATPase proteolipid 
subunitD [EC] 3.6.1.34 




r^AT30 T T Ml nnnion 
UArJU-LLU-UI-UUUJZU 


r^T^U A T^T Ml MMM^TA 

C 1 KA-JJLU-U 1 -UUU J / 0 


XT A 
ISA 




CArJU-LLLi-U i -UUU J/4 


/~"T'D A T~\TTAA7 Ml MMMin 

CIKA-UU W-UI-UUUjj / 


EJNZYME ryruvate Kinase [ECJ z. /.1.4U 


29j 


CrKU-U Wii-Ul-UUiojZ 


/^T^D A T~\TTAA7 Ml MMMM 1 O 

U 1 KA-U U W -U i -UUUU i Z 


AT A 
JNA 


296 


CABO-LLG-0 1-000328 


CTRA-DUW-0 1-0000 13 


ENZYME Cytochrome Oxidase D subunit I 

rpr^i 1 1 M 1 


297 


CABO-LLG-0 1-000329 


CTRA-DUW-0 1-0000 14 


ENZYME Cytochrome Oxidase D subunit 

TT FFPl 110 1 
11 [l-^^J I.IU.J.- 


298 


CABO-LLG-01-000331 


CTRA-DLC-0 1-000860 


NA 




r'ARO T T n ni nnnin 
L.Ar)U-LLLr-Ui-UUUj jZ 


r'TD A r»T C Ml MMMQAl 
1 KA-JJLl^-U i -UUUoO i 


"NT A 
IN A 


inn 


r'ARO T T n m nnniii 

L-ArJU-LLLl-U i -UUU J J J 


r^XD A r\T TAA/ Ml MMMM1 ^ 
L, IKA-JJU W-Ui-UUUUi J 


r UlNL. 1 lUiN rJlOrl-llKe proieiH 


jUi 


ri AT30 T T n m Hfinin 
L.Ar)U-ijijLr-Ui-UUUjJ / 


r'TT? A r»T mi mmmq^a 

IKA-JJLL^-Ui-UUUoJO 


"NT A 
IN A 




r'ARO T T n m nnniis 

L-ArJU-ijijO-U i -UUU J J o 


1 KA-UU W -Ui -UUUUZZ 


ruiNL^iiuiN iviuosoiriai proieiii i^j i 


303 


CABO-LLG-0 1-000342 


CTRA-DUW-0 1-000026 


FUNCTION Ribosomal protein S16 


304 


CABO-LLG-01-000343 


CTRA-DUW-0 1 -000027 


ENZYME tKNA (guanine JM-L) 
methyltransferase [EC] 2.1.1.31 




L-AJjU-ijijO-U i -UUU j44 


1 KA-UU W -Ui -UUUUZo 


ruiNL^injiN iviDOSomai proiein ijiy 




L-ArJU-LLLi-U i -UUU j4j 


r^XD A r\T TAA/ Ml MMMMOO 

1 KA-JJU w -ui-uuuuzy 


UTVTWAyft? D i1-»/-v+-ni^lc»ooc» UTT rt?r^1 1 1 OA A 

liiNZ. liviii KiDonuciease riii [liL-j j.i.zo.4 


307 


CABO-LLG-0 1-000346 


CTRA-DUW-01-000030 


ENZYME Guanylate kinase [EC] 2.7.4.8 


308 


CABO-LLG-01-000358 


CTRA-DUW-0 1-0002 15 


ENZYME Ribose 5 -phosphate isomerase A 
[EC] 5.3.1.6 


1 f\C\ 


A DO T T Ml r\MM'2^n 


C 1 KA-JJLC-U i -UUUooo 


TVT A 
NA 


3 iU 


ADO T T m 


L. 1 KA-JJ U W -U i -UUUz i 3 


AT A 
NA 




r^AT30 T T m nnn'iAO 
L.AoU-LLU-Ui-UUUjoo 


1KA-JJLL.-U1-UUU /oj 


XT A 

JNA 


312 


CABO-LLG-Ol-000374 


CTRA-DUW-0 1-000 147 


ENZYME DNA ligase (NAD+) [EC] 

^ 1 o 

o.j.l.Z 


313 


CABO-LLG-0 1-0003 79 


CTRA-DUW-0 1-0002 10 


ENZYME 3-deoxy-d-manno-octulosonic- 
acid transferase [EC] 2.-.-.- 


11/1 


CABU-LLLr-Ui-UUUjyz 


/^T^D A T~\T TAA7 Ml MMM 1 O/^ 

C IKA-UU W-Ui-UUUloo 


TVT A 
NA 


J i J 


ADO T T Ml MMMIMI 

CABU-LLLr-U i -UUU jVj 


/~"TT) A T~\T TAA7 M 1 MMM 10^ 

CIKA-UU W-Ui-UUUloJ 


ENZYME Ulr syntnetasc [ECJ 0.0.4.Z 


316 


CABO-LLG-0 1-000404 


CTRA-DUW-0 1-000 195 


ENZYME Queuine tRNA- 
ribosyltransferase [EC] 2.4.2.29 


1 1 T 


A T^r^ T T ni nnnyion 
L.Ar5U-LLVj-U 1 -UUU4zU 


/^TD A F\TTAArM1 MMM 110 

U 1 KA-JJU W-Ui -UUU ijZ 


T^T A 
JNA 


318 


CABO-LLG-0 1-000423 


CTRA-DUW-0 1-000 199 


ENZYME (9-sialoglycoprotein 

anAr^^a^t-iAticci rt?/^1 1 A 0/1 ^^7 

enoopeptiaase [tcj j.4.Z4.j / 


^ 1 0 




PTR A niTWOl 000187 


NA 


320 


CABO-LLG-0 1-000426 


CTRA-DUW-0 1-000 188 


ENZYME Glucose-6-phosphate 
dehydrogenase [EC] 


321 


CABO-LLG-0 1-000432 


CTRA-DLC-0 1-000753 


FUNCTION Ribosomal protein S9 


322 


CABO-LLG-0 1-000433 


CTRA-DUW-0 1-000 126 


FUNCTION Ribosomal protein LI 3 


323 


CABO-LLG-01-000435 


CTRA-DUW-0 1-000 152 


NA 


324 


CABO-LLG-0 1-000448 


CTRA-DUW-0 1-000 138 


FUNCTION Sua5 homolog 
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325 


CABO-LLG-01-000451 


CTRA-DUW-0 1-000 190 


bNZYJVlL 1 nymidy late kinase (dlMF 
kinase) [EC] 2.7.4.9 


326 


CABO-LLG-01-000459 


CTRA-DUW-0 1-0002 17 


ENZYME Fructose-bisphosphate aldolase 
class I [EC] 4.1.2.13 


327 


\ Ti /~\ T T A 1 AAA /I T 1 

CABU-LLCj-0 1 -00047 1 


/"'T^n A T^TT^T7 A1 AAA'^OA 

C 1 KA-D U W -0 1 -000239 


b UNC 1 ION acyl carrier protein ACr 


328 


PACA-UV7-0 1-000731 


CTRA-DUW-0 1-000 105 


ENZYME Enoyl-[acyl-carrier protein] 
reductase (NADH) [ECJ 


329 


CABO-LLG-0 1-000473 


CTRA-DLC-0 1-000640 


ENZYME Malonyl CoA-acyl carrier 
protein transacylase [EC] 


330 


CABO-LLG-0 1-000474 


CTRA-DLC-0 1-000639 


ENZYME 3-oxoacyl-[acyl-carrier-protein] 
syntnase 111 [ECJ 


331 


CABO-LLG-0 1-000475 


CTRA-DUW-0 1-000243 


FUNCTION Recombination protein RecR 
homolog 


332 


CABO-LLG-0 1-000477 


CTRA-DUW-0 1 -000245 


NA 


333 


CPNE-TW 1-0 1-0003 87 


CTRA-DUW-0 1-000055 


ENZYME 2-oxoglutarate dehydrogenase 
El component [EC] 1.2.4.2 


O O /I 

334 


/~^ATi/~\ T T A1 AAA/IOj^T 

CABO-LLG-0 1 -000486 


/"'T^n A T^T T^T 7 A1 AAA'^ C /I 

C 1 RA-DU W-0 1 -000254 


b UNC HON Inner -membrane protein YidC 


335 


CABO-LLG-0 1-000489 


CTRA-DUW-01-000101 


ENZYME holo-[acyl-carrier protein] 
syntnase [ECJ 2. /.a. 1 


336 


CABO-LLG-0 1-000490 


CTRA-DUW-0 1-000 100 


bJNZ livir, inioredoxm reductase (JNAUriij 
[JiCj 1.0.4.J 


337 


CABO-LLG-0 1-000494 


CTRA-DUW-0 1 -000096 


FUNCTION Ribosome-binding factor A 

KuIA 


338 


CABO-LLG-0 1-000496 


CTRA-DUW-0 1 -000094 


ENZYME Riboflavin kinase [EC] 2.7.1.26 


'2 '20 


CABU-LLLr-Ui-UUU4yy 


C IKA-UU w-ui-uuuuyu 


VT A 

NA 


340 


CABO-LLG-01-000500 


CTRA-DUW-0 1-000089 


NA 


O /I 1 

341 


A i->/-\ TT/"* A1 AAACA1 

CABO-LLG-0 1 -00050 1 


/^T^n A T~\T ^ A1 AAATAO 

C lRA-DLC-01-000793 


b UNC HON Ribosomal protein L28 


342 


CABO-LLG-01-000508 


CTRA-DLC-0 1-000801 


ENZYME Methylenetetrahydrofolate 
dehydrogenase [ECJ 1.5.1.15 


343 


CABO-LLG-01-000509 


CTRA-DUW-0 1-000078 


ruiNL^iiuiN 1 niamme uiosynxnesis 
lipoprotein ApbE precursor 


344 


CABO-LLG-0 1-0005 10 


CTRA-DUW-0 1-000077 


ruiNi^iii^iN omaii proiem 13 ompo 
homolog 


345 


CABO-LLG-0 1-0005 11 


CTRA-DUW-0 1 -000076 


ciNZ^ 1 ivijj/ iJiN/\ polymerase iii oeia cnam 
[EC] 2.7.7.7 


346 


CABO-LLG-0 1-0005 14 


CTRA-DUW-0 1-000073 


SIMILAR-TO zinc protease [EC] 


347 


CABO-LLG-0 1-0005 16 


CTRA-DUW-0 1-000071 


FUNCTION ABC transporter, permease 
protein TroD 


348 


CABO-LLG-0 1-0005 19 


CTRA-DUW-0 1-000068 


FUNCTION periplasmic substrate binding 
protein TroA 


349 


CPSLCAL-0 1-000799 


CTRA-DUW-0 1 -000423 


FUNCTION high-affinity ZnuA homolog 


350 


CABO-LLG-01-000520 


CTRA-DLC-0 1-0008 12 


NA 



Genes 2012, 3 



315 



Appendix-Table 1. Cont. 



Cluster ID 


Lead Sequence 


Master Sequence 


Function Annotation from Master 
Sequence 


351 


CABO-LLG-01-000523 


CTRA-DUW-0 1 -000064 


ENZYME 6-phosphogluconate 
dehydrogenase [EC] 1.1.1.44 


352 


CABO-LLG-0 1-000524 


CTRA-DUW-0 1 -000063 


bJNZ/iMii lyrosyl-tKJNA jyntnetase [iiCJ 


353 


CABO-LLG-01-000535 


CTRA-DLC-0 1-000825 


NA 




CArSU-LLLi-U i -UUU j4 i 


C 1 KA-JJLC-U i -UUUs J i 


\T A 

NA 


355 


CABO-LLG-0 1-000544 


CTRA-DUW-0 1 -000045 


c UJNU 1 lUJN suigle-strandecl JJJNA-Dincling 
protein SSB 




CABU-LLLi-Ui-UUUj4j 


U 1 KA-U U W -U i -UUUU44 


\T A 

NA 


35 / 


/^ADO T T m 

CABU-LLLr-Ui-UUUj4 / 


U 1 KA-U U W -U i -UUUU4Z 


\T A 

NA 


358 


CABO-LLG-01-000554 


CTRA-DLC-0 1-0006 19 


ENZYME Protein phosphatase 2C [EC] 

J.i.J.iO 


359 


CABO-LLG-01-000558 


CTRA-DUW-0 1-000257 


NA 


360 


CABO-LLG-01-000560 


CTRA-DUW-0 1-000 108 


ENZYME A/G-speciiic adenine 
glycosylase [EC] 3.2.2.- 




A T30 T T Ml MMH^A/I 

L.Ar>U-LLU-Ui-UUUjo4 


L. IKA-JJU W-Ui-UUUiU4 


\T A 

NA 


362 


CABO-LLG-Ol-000571 


CTRA-DUW-0 1-000268 


ENZYME Acetyl-coenzyme A carboxylase 
carboxyl transferase 


363 


CABO-LLG-01-000574 


CTRA-DUW-0 1-000271 


ENZYME N-acetylmuramoyl-L-alanine 

dllllUaae /\1I11-D LJ_/V_yJ J.J.l.Zo 






1 lv/\-iJU W -U i -UUUZ / J 


r u IN 1 ii^i N r eniciiiin-uinciing proiem j 


365 


CABO-LLG-01-000578 


CTRA-DUW-0 1 -000274 


NA 




r'ARO T T n ni nnn^Qi 
L.Ar)U-LLLr-Ui-UUUjo i 


r^T'T? A r\T TAA/ Ml MMM077 
IKA-JJU W-Ui-UUUZ / / 


UUlVLAliN 1 r K 


367 


CABO-LLG-0 1-0005 85 


CTRA-DLC-0 1-000601 


NA 


JOO 


/^ADO T T Ml nnn^O/; 

CABU-LLLr-Ui-UUUjoo 


U 1 KA-U U W -U i -UUUZOZ 


\T A 

NA 


369 


CABO-LLG-0 1-0005 87 


CTRA-DLC-0 1-000599 


NA 


370 


CFLC-L5 8-0 1 -000614 


-p n A T^T T^T 7 A1 AAAOO/1 

C lRA-DUW-0 1-000284 


NA 


371 


CABO-LLG-01-000591 


CTRA-DLC-0 1-000597 


FUNCTION Glycine cleavage system H 
protein 


372 


CABO-LLG-01-000594 


CTRA-DUW-0 1 -000288 


SIMILAR-TO Lipoate-protein ligase A 
[ECJ 6.3.4.- 


373 


CABO-LLG-01-000596 


CTRA-DUW-0 1-000290 


ENZYME tRNA (5-methylaminomethyl-2- 
thiouridylate)-methyltransferase 


374 


CABO-LLG-0 1-000601 


CTRA-DUW-0 1 -000293 


ENZYME Nitrogen regulatory IIA protein 
A component [EUJ Z. /A.oy 


375 


WCHO-WSU-0 1-000243 


CTRA-DUW-0 1 -000294 


ENZYME Nitrogen regulatory IIA protein 
r\ component [l-^j z. / .i.oy 


376 


CABO-LLG-0 1-000603 


CTRA-DUW-0 1-000295 


ENZYME dUTP pyrophosphatase [EC] 
3.6.1.23 


377 


CABO-LLG-0 1-000608 


CTRA-DUW-0 1 -000300 


ENZYME Ribonuclease III [EC] 3.1.26.3 


378 


CABO-LLG-0 1-000609 


CTRA-DLC-0 1-0005 81 


FUNCTION DNA repair protein RadA 


379 


CABO-LLG-0 1-0006 10 


CTRA-DUW-0 1 -000302 


ENZYME Porphobilinogen deaminase [EC] 
4.3.1.8 
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CArJU-LLLi-U i -UUUo 1 o 


/"^T^D A T~\T TAA7 Ml MMM'2/IM 

C IKA-UU W-Ui-UUUj4U 


"\T A 

JNA 




CArJU-LLLi-U i -UUUoz J 


C 1 KA-JJ U W -U i -UUU j4o 


r\OA A A TAT T~\« r. T 

UUMAIJN UnaJ 




A DO T T Ml r\Mr\/^o/i 
CArJU-LLLi-U i -UUUoz4 


r^'Tn A T^T Ml MMM^'2/; 


rUJNCllUJN RiDosomal protein c>z i 


383 


CABO-LLG-0 1-000628 


CTRA-DUW-0 1-000351 


ENZYME Aryl-sulfate sulphohydrolase 
[EC] 3.1.6.1 


384 


CABO-LLG-0 1-000631 


CTRA-DUW-0 1-000354 


r UJNCl lUJN septum lormation protein Mai 
homolog 




CAJDU-LLLi-Ui-UUUojz 


/"^T^D A T~\T TAA7 Ml MMM'2^^ 


"\T A 

JNA 


386 


WCHO-WSU-0 1 -000567 


CTRA-DUW-0 1 -000392 


NA 


387 


CABU-LLCi-0 1-000633 


/^T^n A T^T T^T 7 A1 AAAOC/' 

ClKA-DUW-0 1-000356 


"NT A 

NA 


388 


CABO-LLG-01-000636 


CTRA-DUW-0 1-000333 


ENZYME Triosephosphate isomerase [EC] 

^'211 
J.J.l.i 


389 


CABO-LLG-01-000637 


CTRA-DUW-0 1-000334 


ENZYME Exonuclease VII large subunit 

[JiL-J J. 1 . 1 1 .0 


390 


CABO-LLG-0 1-000641 


CTRA-DUW-0 1 -000360 


ENZYME Dimethyladenosine transferase 

[JiL-J Z. 1 . i .- 


jy i 


ri AT30 T T n m nnHA/io 
L.Ar)U-ijijLr-Ui-UUUo4Z 


r^TD A F\T TAA/ Ml MMMl/il 


IN A 


392 


CABO-LLG-0 1-000643 


CTRA-DUW-0 1 -000362 


DOMAEVf Thioredoxin 


'2 C\1 


A DO T T Ml r\MM/;/i/; 
CAJ3U-LLLr-Ui-UUUo4o 


/^T^D A r^T Ml MMMO/;0 


"\T A 

JNA 




CAr5U-LLLr-U i -UUUo4 / 


/^T^D A T^T /"^ Ml MMMO/;n 


ENZYME RiDonuclease rill [ECJ j.i.ZO.4 


395 


CABO-LLG-0 1-000651 


CTRA-DUW-0 1 -000004 


EJNZYME glUtamyl-tRJNA (Lrmj 
amidotransferase, subunit B [EC] 


'2 


A DO T T Ml CkPkOk^nCk 

CAoU-LLLr-Ui-UUlJo /U 


/^T^D A T~\T TAA7 Ml C\f\C\1Q^ 


AT A 
JNA 


397 


CABO-LLG-0 1-000671 


CTRA-DUW-0 1-0003 87 


alJVllEAK- 1 u metallo-ueta-lactamase [ECJ 

'2 ^ 

J.J.-.- 


jyo 


r'ADO T T n ni nnn^Q/i 
L.Ar)U-LLLr-Ui-UUUOo4 


r'TTJ A r»T C Ml MMM/1QQ 
1 KA-JJLl^-U i -UUU40O 


"NT A 
IN A 


399 


CABO-LLG-01-000686 


CTRA-DUW-0 1-000396 


NA 


400 


CABO-LLG-0 1-000690 


CTRA-DLC-0 1-000484 


FUNCTION Heat-inducible transcription 
repressor HrcA 


401 


CABO-LLG-0 1-000691 


CTRA-DUW-0 1 -000403 


FUNCTION GrpE protein 




A t30 T T Ml MMM/^OO 


r^^T> A r\T TAA? Ml MMM/1'2 0 

C IKA-JJU W-Ui-UUU4jZ 


"NT A 
JNA 


403 


CABO-LLG-0 1-000701 


CTRA-DUW-0 1 -000435 


NA 


404 


CABO-LLG-0 1-000705 


CTRA-DUW-0 1-00043 8 


ENZYME ubiquinone/menaquinone 
biosynthesis methlytransferase 


4Uj 


CAJ3U-LLLr-Ui-UUU /Do 


r^'TT> A T^T M 1 MMM/1/1M 

C 1 KA-ULC-U 1 -UUU44y 


"NT A 
JNA 


406 


CABO-LLG-0 1-000707 


CTRA-DUW-0 1 -000440 


j_/i>iZj 1 ivijj/ jjianiinopiriieiaie cpirnerasc 
[EC] 5.1.1.7 


407 


CABO-LLG-0 1-000709 


CTRA-DLC-0 1-000446 


ENZYME Serine hydroxymethyltransferase 
[EC] 2.1.2.1 


408 


CABO-LLG-0 1-0007 13 


CTRA-DUW-0 1 -000406 


NA 


409 


CABO-LLG-0 1-0007 14 


CTRA-DLC-0 1-000479 


NA 


410 


CABO-LLG-0 1-0007 17 


CTRA-DUW-0 1-0004 10 


ENZYME Lipid A 4'-kinase [EC] 2.7.1.130 
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411 


CABO-LLG-0 1-000722 


CTRA-DLC-0 1-000471 


FUNCTION DnaK suppressor protein 


412 


CABO-LLG-0 1-000723 


CTRA-DUW-0 1-0004 16 


ENZYME Lipoprotein signal peptidase 

[JiL-J J.4.ZJ.J0 


413 


CABO-LLG-0 1-000735 


CTRA-DUW-0 1 -000427 


FUNCTION Ribosomal protein L27 


A ^ A 
414 


CABU-LLLr-Ui-UUU / jo 


C lKA-iJLC-Ui-UUU4jy 


rUJNCllUJN KiDOSomal protein Ezi 


A ^ z 
4i J 


ADO T T m nnnTSO 

CABU-LLLi-Ui-UUU /jo 


C 1 KA-JJ U W -U i -UUU444 


\T A 

NA 


416 


CABO-LLG-01-000739 


CTRA-DUW-0 1 -000445 


FN7YMF "snlfitp rfHiipftiQP ^'NADPH'^ 

flavoprotein alpha-component 


4.1 7 


TARD T T n 01 00074.0 


TTR A DT r 01 000449 

V i rvrV l^l^V^ W i \j\j\j'-r'-r 


PTTMPTTDN RiVinsnmnl nrntpin "s 1 0 


418 


CABO-LLG-0 1-000751 


CTRA-DUW-0 1 -000456 


FN7YMF Glntamvl-tRNA Svnthpfasp FFn 
6.1.1.17 


410 


TARD-T T G-01 -00075? 


TTRA-DT T-OI -000431 


NA 


420 * 


CABO-LLG-0 1-000753 




NA 


421 


CABO-LLG-0 1-000754 


CTRA-DUW-0 1 -00045 8 


ENZYME bmgle-stranaea-DNA-speciric 
exonuclease RecJ [EC] 


/too 
'Ml 


ADO T T m f\Ckf\n ZCi 


C 1 KA-U U W -U i -UUU4o J 


EJNZYME Cytidylate kinase [ECJ z. /AA'* 


423 


CABO-LLG-0 1-000761 


CTRA-DUW-0 1 -000465 


EJNZYME Argmyl-tKJNA byntnetase [ECJ 
6.1.1.19 


424 


CABO-LLG-0 1-000762 


CTRA-DUW-0 1-000466 


bJNZiMr, UJJr-_/v-acetyigiucosamine i- 
carboxyvinyltransferase [EC] 


4Zj 


UAr5U-LLLr-Ul-UUU /o4 


U 1 KA-JJU W-U1-UUU400 


INA 


4zo 


A DO T T Ml MMMTTO 

CABU-LLLi-Ui-UUU / / o 


C IKA-UU W-Ui-UUU4oU 


\T A 

NA 


427 


CABO-LLG-0 1-000779 


CTRA-DUW-0 1-000481 


NA 


428 


CABO-LLG-0 1-000784 


CTRA-DUW-0 1 -000486 


ENZYME rnenylalanyl-tRNA synthetase 
beta chain [EC] 6.1.1.20 


429 * 


CABO-LLG-0 1-000789 


CTRA-DUW-0 1-000491 


b UNCI ION Dipeptide bmamg protein 
DppA 


430 


CABO-LLG-0 1-000792 


/~"T^n A T^T T^T 7 A1 AAAylA/' 

C 1 RA-DU W-0 1 -000496 


KT A 

NA 


431 


CABO-LLG-0 1-000793 


CTRA-DUW-0 1 -000497 


ENZYME Frotoheme lerro-lyase [ECJ 

A CiCi 1 1 

4.yy.i.i 


432 


CABO-LLG-0 1-000794 


CTRA-DUW-0 1-000498 


rUNCllUN Ammoacid-Dinclmg 
periplasmic protein precursor 


433 


CABO-LLG-0 1-000795 


CTRA-DUW-0 1-000499 


ENZYME HemK modification methylase 
homolog [EC] 


434 


CABO-LLG-0 1-000796 


CTRA-DUW-0 1 -000500 


NA 


H- J J 


PARn T T n 01 000801 


PTR A DTP 01 OOO^S^ 


nOA/TAlM ATP ViinHinrr 


436 


CABO-LLG-01-000802 


CTRA-DUW-0 1-000505 


ENZYME DNA polymerase I [EC] 2.7.7.7 


437 


CABO-LLG-01-000803 


CTRA-DLC-0 1-0003 84 


NA 


438 


CABO-LLG-0 1-000805 


CTRA-DUW-0 1-000508 


ENZYME CDP-diacylglycerol-glycerol-3- 
phosphate 


439 


CABO-LLG-0 1-000807 


CTRA-DUW-0 1-0005 11 


FUNCTION Glucose inhibited division 
protein A GidA 
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440 


CABO-LLG-01-000808 


CTRA-DUW-0 1-0005 12 


LNZYML Lipoate-protem ligase A [LCj 
6.3.4.- 


441 


CABO-LLG-01-000810 


CTRA-DUW-0 1-0005 14 


LNZYML Holliday Junction DNA Helicase 
RuvA [r,CJ 


442 


CABO-LLG-0 1-0008 11 


CTRA-DUW-0 1-0005 15 


JiJNZYMr, hloiliclay Junction UJNA Helicase 
Rin/r rpri i i 99 a 




L-ArJU-ijijO-Ui -UUUo i J 


r'TR A r»TT\/l/ 01 000^17 


M A 
IN A 


444 


CABO-LLG-0 1-0008 14 


CTRA-DUW-0 1-0005 18 


ENZYME Glyceraldehyde 3 -phosphate 

^\e^\^\\T^\■t'^\c\c^n':\c^e^ \ In 1 11^1 1 / 




PARn T T n 01 000890 


r^TT? A DTTW 01 000S94 
1 l\r\-LJ U VV - W i - WWW JZH- 


PTlATr^TTnTsJ RihACAmal r\rr\fe^\n T 1 S 
r UlN i IVJIN rvlUUoUlllal piULClIl J-; i J 


AAfii 


r'ARO T T n 01 000891 


r'TR A m TW 01 000 9^ 
1 JvA.-iJU W - W i - WWW JZ J 


ruiNL>iiui>i iviDOSOiriai proicin oj 


/ 


r'ARO T T n 01 000899 


r'TR A m TW 01 000'n9A 
1 rvA.-iJU W - Wi - WWWjZO 


PT TMPTTOM RiKACAmQl T\rr\if^\r\ T 1 S 

ruiNL>iiUi>i iviDOSOinai proicin i^i o 




PARO T T n 01 000894 


r'TR A r>TTW 01 000^98 
1 JvA.-iJU W - W i -WWWjZo 


PT lATPTTOM R iV»AC!AmQl T\rr\i(^\r\ 


440 


PARn T T n 01 00089S 


r'TR A DTTW 01 000S9Q 
yy 1 IVrV-U U VV - W i - WWW JZV 


rUlN^^lH^lN ivlUUoUIllai piULClIl L^J 


4^n 


r'ARO T T n 01 00089A 


r'TR A m TW 01 000^10 
1 rvA.-iJU W - Wi -WWWJ jW 


PT TMPTTOM RiKACAmQl T\rr\if^\r\ T lA 

r uiNL> 1 luiN iviDOSOiriai proicin \^z.^■ 


4^1 


r'ARO T T n 01 000897 


rTR A r>TTW 01 000^11 
1 JvA.-iJU W - Wi - WWWJ J i 


PTTMPTTOM RiV»AC!AmQl nrrtft^in T 14 

r uiNL. 1 iL^iN iviuosoniai proieiri i^iH- 


4^9 


r'ARO T T n 01 000898 


rTR A r>TTW 01 000^19 
1 i\J\-lJ\J W - Wi - WWWJ jZ 


PTTMPTTOM RiV»AC!AmQl nrrtft^in 7 


4S^ 
H- J J 


PARn T T n 01 0008^0 


rTR A DTTW 01 000S14 
^ i iVrV-U U VV - W i - WWW J J 


PT mjr^TTnTsJ RihACAmal r^rntf^in T ^ fx 
r UlN i IVJlN ivlUUoUlllal piULClIl J-* i U 


454 


CABO-LLG-0 1-000831 


CTRA-DUW-0 1-000535 


FUNCTION Ribosomal protein S3 




r'ARO T T n 01 000811 


1 KA-JJU W-Ui-UUUJ J / 


rUiNUiiuiN Kluosomai proiem oiy 


456 


CABO-LLG-01-000834 


CTRA-DUW-Ol-000538 


FUNCTION Ribosomal protein L2 


45 / 




C IKA-UU Vv-Ui-UUUjj9 


r UNC IlUN KiDOSomal protein LZj 


458 


CABO-LLG-01-000836 


CTRA-DLC-0 1-000351 


FUNCTION Ribosomal protein L4 


459 


CABO-LLG-0 1-000837 


/"'T^n A T^T JAIlJ A1 AAAC/11 

C 1 RA-DU W-0 1 -00054 1 


b UN C 1 ION Ribosomal protein L3 


460 * 


CABO-LLG-0 1-000839 


CTRA-DUW-0 1-000543 


ENZYME Methionyl-tRNA 
formyltransferase [EC] 2.1.2.9 


461 


CABO-LLG-0 1-000841 


CTRA-DUW-0 1-000545 


EN Z Y ME (3 R) -hy droxymyristoy 1- [acy 1 
carrier protein] dehydratase 


462 


CABO-LLG-01-000842 


CTRA-DLC-0 1-000345 


ENZYME UDP-3 -0- [3 -hydroxymyristoy 1] 
A'-acetylglucosamine 


463 


CABO-LLG-0 1-000843 


CTRA-DUW-0 1-000547 


ENZYME apolipoprotein N-acyltransferase 

rt7/^i oil 
[ECJ Z.j.i. 


464 


CABO-LLG-01-000846 


CTRA-DUW-0 1-000550 


DOMAIN ATP-binding 


465 


\ T) /~\ T T A 1 AA AO A n 

CABO-LLG-0 1-000847 


i^T^n A T~\TT\T7 A1 AAACC1 

C 1 RA-DU W-0 1-0005 51 


XT A 

NA 


466 


CABO-LLG-0 1-000849 


CTRA-DUW-0 1-000553 


ENZYME rRNA methyltransrerase SpoU 
homolog [EC] 


467 


CABO-LLG-01-000852 


CTRA-DUW-0 1-000556 


jj/iNZ^ 1 ivijj/ riisxiGyi-xKiN/\- oynxnexase [l-^j 
6.1.1.21 


468 


CABO-LLG-01-000855 


CTRA-DUW-0 1-0005 5 8 


ENZYME DNA polymerase III alpha chain 
[EC] 2.7.7.7 


469 


CABO-LLG-01-000856 


CTRA-DUW-0 1-000559 


NA 


470 


CABO-LLG-01-000857 


CTRA-DLC-0 1-000331 


NA 


471 


CABO-LLG-01-000858 


CTRA-DUW-0 1-000561 


NA 
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Cluster ID 


Lead Sequence 


Master Sequence 


Function Annotation from Master 
Sequence 


472 


CABO-LLG-01-000860 


CTRA-DLC-0 1-000327 


ENZYME D-alanyl-D-alanine 
carboxypeptidase Dacr [ECJ 3.4.16.4 


473 


CABO-LLG-01-000865 


CTRA-DUW-0 1-0007 10 


ENZYME Phosphoglycerate kinase [EC] 

O T O O 

2.7.2.3 


474 


CABO-LLG-0 1-000867 


CTRA-DUW-0 1-000708 


FUNCTION Phosphate transport system 
protein PhoU 


475 


CABO-LLG-0 1-000874 


CTRA-DUW-0 1 -000703 


r UNCI ION ABC transporter, AlP-bmamg 
protein 


476 


SNEG-ZXX-01-002117 


CTRA-DUW-0 1-000701 


r UJNU 1 lUJN Ar>u transporter, A l r-Dinamg 
protein 


477 


CABO-LLG-01-000880 


CTRA-DUW-0 1 -000697 


FUNCTION Ribosomal protein S2 


478 


CABO-LLG-01-000881 


CTRA-DLC-0 1-000 198 


FUNCTION Translation elongation factor 
EF-TS 


4 /y 


CABU-LLLr-U i -UUUooZ 


U 1 KA-U U W -U i -UUUoy J 


bJNZ,YME Uridylate Kinase [ECJ z. I A. 




CABU-LLLr-Ui-UUUoyz 


L IKA-UU VV-Ui-UUUoo4 


AT A 
JNA 


481 


CABO-LLG-01-000895 


CTRA-DUW-0 1-000681 


DOMAIN FHA 


482 


CABO-LLG-0 1-000897 


CTRA-DUW-0 1-000679 


ENZYME glutamyl-tRNA reductase [ECJ 
l.Z.l. 


483 


CABO-LLG-0 1-000904 


CTRA-DUW-0 1-000672 


EJNZYME KJJU-o-pnospnate syntnetase 
[EC] 4.1.2.16 


AQA 

484 


CABU-LLLr-Ui-UUUy 12 


C 1 KA-ULC-U 1 -UUUZ j4 


AT A 
NA 


485 


CABO-LLG-0 1-0009 13 


CTRA-DUW-0 1 -000640 


SIMILAR-TO Endonuclease IV [EC] 

'2 1 O 1 O 

J.l.zl.z 


486 


CABO-LLG-0 1-0009 14 


CTRA-DUW-0 1-000641 


FUNCTION Ribosomal protein S4 


487 


CABO-LLu-U 1-000916 


ClRA-DUW-0 1-000657 


FUNCIION Multidrug-einux transporter 


488 


CABO-LLG-0 1-0009 17 


CTRA-DLC-0 1-00023 8 


ENZYME Exodeoxyribonuclease V gamma 
subunit [EC] 3.1.11.5 


489 


CABO-LLG-0 1-000920 


CTRA-DUW-0 1 -000652 


oiiviiijAJv- 1 u Ammo-aciu 
aminotransferase class I [EC] 2.6.1. 


490 


CABO-LLG-0 1-000921 


CTRA-DUW-0 1-000651 


FUNCTION Transcription Elongation 
Factor GreA C-terminus 


491 


CABO-LLG-0 1-000923 


CTRA-DUW-0 1 -000649 


NA 


492 


CABO-LLG-0 1-000924 


CTRA-DLC-0 1-000245 


ENZYME Porphobilinogen synthase [EC] 
4.2.1.24 



* Eight clusters do not contain one member per genome exactly and are marked; these include 
cluster 420 which does not contain a master sequence from the C. trachomatis original annotation 
dataset; NA: not available. Lead sequence is first sequence found in cluster; master sequence is 
sequence where annotation is drawn irom (see Experimental Section for details). 
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